In this section, we’re going to use California state data to build various models and compare their performance.

The dataset has already been properly partitioned, so now let’s build some models.

Load packages and read data

library(tidyverse)
library(kableExtra)
library(scales)
library(caret)
library(modelr)
library(ROSE)
library(glmnet)
library(rpart)
library(rpart.plot)
library(randomForest)
library(plotly)

Sampling

Just one more step before we start building models. By grouping the 4 severity levels into 2 levels, now the dataset is more balanced in severity levels. However, from the plot below, we can see the records in each severity level are still not equal.

This is not a big issue actually, but with a balanced data we can better train the model and interpret the final accuracy more easily (both sensitivity and specificity need to be high to gain a higher total accuracy). So, let’s apply some sampling techniques to make the data balanced.

Here, both oversampling and undersampling are used to make the data balanced. Also, by applying sampling techniques, we reduce the data size to a scale that is more easily to manipulate.

new_train <- ovun.sample(Status ~ ., 
                         data = train_set, 
                         method = "both", p = 0.5, N = 90000, seed = 1)$data %>% as_tibble()

Logistic regression

Since our response variable has 2 levels now, “Severe” and “Not Severe”, it’s reasonable that we choose logistic regression as our base line model.

To gain the best formula for logistic regression, we can apply the stepwise model selection process. Here, we cannot use the root-mean-square as our criterion, instead we can use some statistical values, like AIC or BIC. In general, BIC is more strict about variables, the final formula based on BIC will contain less perdictors than using AIC. But there is no absolute conclusion saying which one is the best. So, we just use AIC value here.

# use step() to apply stepwise model selection
model_aic <- glm(Status ~ ., data = new_train, family = "binomial")
model_aic <- step(model_aic)

These variables are dropped:

Vaiables to drop AIC
Day 101128.8
Temperature 101126.8
Sunrise_Sunset 101124.8
Nautical_Twilight 101122.9
Astronomical_Twilight 101121.1
Civil_Twilight 101119.9
Humidity 101118.7

The final formula based on AIC value:

## glm(formula = Status ~ TMC + Year + Month + Hour + Wday + Duration + 
##     Start_Lat + Start_Lng + Distance + Side + Pressure + Wind_Speed + 
##     Weather_Condition + Junction + Traffic_Signal, family = "binomial", 
##     data = new_train)

Make predictions on validation dataset. Here, we choose 0.6 as the cutoff (transform probability to response variable levels) to gain a higher total accuracy.

We can see the performance of logistic regresson by using confusion matrix.

Accuracy Sensitivity Specificity Positive term
0.7154623 0.7352326 0.6754223 Not Severe
## Confusion Matrix and Statistics
## 
##             
##              Not Severe Severe
##   Not Severe      64002  13951
##   Severe          23048  29031
##                                          
##                Accuracy : 0.7155         
##                  95% CI : (0.713, 0.7179)
##     No Information Rate : 0.6695         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.3898         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.7352         
##             Specificity : 0.6754         
##          Pos Pred Value : 0.8210         
##          Neg Pred Value : 0.5574         
##              Prevalence : 0.6695         
##          Detection Rate : 0.4922         
##    Detection Prevalence : 0.5995         
##       Balanced Accuracy : 0.7053         
##                                          
##        'Positive' Class : Not Severe     
## 

From the result above, we can see the performance of normal logistic regression is not satisfying. Let’s try a different model.

Sparse logistic regression

Since we have so many variables in our dataset, it’s possible that some of the variables’ coefficients may be near zero in the final best model. So, we decide to try sparse logistic model next.

Sparse logistic regression uses a “Lasso” penalty: as the tuning parameter \(\lambda\) increases, it’ll force more variables to have coefficient zero. From the plot below, we can see the change of variables’ coefficient.

x <- model.matrix(Status ~ ., data = new_train)
model_total <- glmnet(x, new_train$Status, family = "binomial")
plot(model_total)

To get the best sparse logistic model, we need to find the best tuning parameter \(\lambda\). Cross validation is used to find the best \(\lambda\) here.

model_lambda <- cv.glmnet(x, new_train$Status, family = "binomial")
plot(model_lambda)

With the best tuning parameter \(\lambda\), we then build the model and make predictions on the validation dataset. We also set the cutoff as 0.6 to gain a higher total accuracy.

Accuracy Sensitivity Specificity Positive term
0.7138375 0.7277552 0.6856505 Not Severe
## Confusion Matrix and Statistics
## 
##             
##              Not Severe Severe
##   Not Severe      63354  13512
##   Severe          23700  29472
##                                           
##                Accuracy : 0.7138          
##                  95% CI : (0.7114, 0.7163)
##     No Information Rate : 0.6695          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.39            
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.7278          
##             Specificity : 0.6857          
##          Pos Pred Value : 0.8242          
##          Neg Pred Value : 0.5543          
##              Prevalence : 0.6695          
##          Detection Rate : 0.4872          
##    Detection Prevalence : 0.5911          
##       Balanced Accuracy : 0.7067          
##                                           
##        'Positive' Class : Not Severe      
## 

Compared to the previous normal logistic model, it seems the performance is similar. So, sparse logistic regression cannot make better predictions in this case. We still need to explore other models.

Decision trees

Some algorithms based on trees can do a good job in classification. Also, these algorithms have a built in feature selection process, which means we don’t need to be very picky about our variables.

Next, we are going to try decison trees, a very useful algorithm with a readily comprehensible concept.

model_decision <- rpart(Status ~ ., data = new_train, method = "class", minsplit = 20, cp = 0.001)

Usually we can plot the decision tree to see all the nodes. But here to reach a higher accuracy, we have to take many variables into account (set cp = 0.001), which makes the final tree quite complicated and cannot be easily plotted.

rpart.plot(model_decision, box.palette = "RdBu", shadow.col = "grey", )

After we build the tree, let’s make predictions on the validation dataset.

Accuracy Sensitivity Specificity Positive term
0.8525123 0.8523101 0.852922 Not Severe
## Confusion Matrix and Statistics
## 
##             
##              Not Severe Severe
##   Not Severe      74197   6322
##   Severe          12857  36662
##                                           
##                Accuracy : 0.8525          
##                  95% CI : (0.8506, 0.8544)
##     No Information Rate : 0.6695          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6791          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.8523          
##             Specificity : 0.8529          
##          Pos Pred Value : 0.9215          
##          Neg Pred Value : 0.7404          
##              Prevalence : 0.6695          
##          Detection Rate : 0.5706          
##    Detection Prevalence : 0.6192          
##       Balanced Accuracy : 0.8526          
##                                           
##        'Positive' Class : Not Severe      
## 

From the result above, we can see decision tree really gives a better performance than the previous two logistic regression model. What’s more, it takes much less time to train a decision tree than logistic models.

So far, decision tree is the best model we have.

Random forest

As we all know, decision tree has an obvious disadvantage (not quite obvious here though) that it may have a high accuracy on training dataset but a much lower accuracy on test dataset, which is the result of overfitting.

And random forest can alleviate this overfitting effect by applying a special sampling technique called “bootstrapping”. By analyzing the final out-of-bag error rate, a more practical model can be obtained.

Let’s see if random forest can imporve the accuracy even better.

model_rf <- randomForest(Status ~ ., data = new_train, mtry = 6, ntree = 500)

These two arguments here are very important:

Name Description
mtry Number of variables randomly sampled as candidates at each split
ntree Number of trees to grow

To train a better random forest model, we need to find proper values for these two arguments.

Use the plot below to see whether ntree = 500 is enough:

As we can see, as the number of trees increases, the error rate tends to be a fixed number. So, ntree = 500 is enough.

Let’s find out the best value for mtry:

It’s obvious that mtry = 18 should be the best input.

Next, we use ntree = 500 and mtry = 18 to build the random forest model and make predictions on validation dataset.

Accuracy Sensitivity Specificity Positive term
0.8849106 0.870184 0.9147357 Not Severe
## Confusion Matrix and Statistics
## 
##             
##              Not Severe Severe
##   Not Severe      75753   3665
##   Severe          11301  39319
##                                           
##                Accuracy : 0.8849          
##                  95% CI : (0.8832, 0.8866)
##     No Information Rate : 0.6695          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7511          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.8702          
##             Specificity : 0.9147          
##          Pos Pred Value : 0.9539          
##          Neg Pred Value : 0.7767          
##              Prevalence : 0.6695          
##          Detection Rate : 0.5825          
##    Detection Prevalence : 0.6107          
##       Balanced Accuracy : 0.8925          
##                                           
##        'Positive' Class : Not Severe      
## 

According to the result above, random forest does improve the accuracy compared to decision tree. However, the time consumed by training and finding the best random forest model is tremendously longer than training a decent decision tree model.

Conclusion

In conlusion, considering both performance and the time needed to train the model, I prefer using decision tree to make predictions. But if we care about nothing but accuracy, then I suppose random forest will be the winner.

The table below contains the performance of each model:

Model Accuracy Sensitivity Specificity
Logistic Regression 0.7154623 0.7352326 0.6754223
Sparse Logistic Regression 0.7138375 0.7277552 0.6856505
Decision Tree 0.8525123 0.8523101 0.8529220
Random Forest 0.8849106 0.8701840 0.9147357